Skip to content

perf: build partition filter with balanced tree to avoid RecursionError#3264

Merged
geruh merged 1 commit intoapache:mainfrom
yingjianwu98:yingjianw/build_balence_tree_for_partition_filter
Apr 22, 2026
Merged

perf: build partition filter with balanced tree to avoid RecursionError#3264
geruh merged 1 commit intoapache:mainfrom
yingjianwu98:yingjianw/build_balence_tree_for_partition_filter

Conversation

@yingjianwu98
Copy link
Copy Markdown
Contributor

@yingjianwu98 yingjianwu98 commented Apr 21, 2026

Rationale for this change

We recently encountered recursion depth errors in _build_partition_predicate when performing overwrites involving a large number of partitions.

One way to address this is to switch to a balanced-tree implementation, which is already used in several other parts of the PyIceberg codebase to avoid recursion issues. This change aligns _build_partition_predicate with those existing patterns. For reference, see for example:
#1830

Are these changes tested?

Yes, with existing tests and also added another test to ensure it works with large number of input partitions

Are there any user-facing changes?

No

Copy link
Copy Markdown
Member

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for raising @yingjianwu98! I was able to test this locally and run into some recursion errors. Just left a non blocking comments.

Comment thread pyiceberg/table/__init__.py Outdated
match_partition_expression = And(match_partition_expression, predicate)
expr = Or(expr, match_partition_expression)
return expr
partition_fields = [schema.find_field(f.source_id).name for f in spec.fields]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't need to make this style diff so we don't lose blame history.

Comment thread tests/table/test_init.py Outdated
def test_build_large_partition_predicate(table_v2: Table) -> None:
"""A left-folded Or chain over 5000 records would be depth-5000 and crash bind()
(Python's default recursion limit is ~1000). The balanced tree has depth ~14."""
from pyiceberg.expressions.visitors import bind
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we import these at the class level?

Comment thread tests/table/test_init.py Outdated


def test_build_large_partition_predicate(table_v2: Table) -> None:
"""A left-folded Or chain over 5000 records would be depth-5000 and crash bind()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: idk how I feel about describing the old behavior in the docstring. I think it would be better to describe the current situation, i.e.

"""Partition predicates over many records must bind without RecursionError."""

But at that point the test name already says that and makes me wonder if the docstring is needed maybe we could get rid of it. Again non-blocking and thinkning out loud.

@yingjianwu98 yingjianwu98 force-pushed the yingjianw/build_balence_tree_for_partition_filter branch from 28b54d6 to d21f99a Compare April 22, 2026 02:59
@yingjianwu98
Copy link
Copy Markdown
Contributor Author

Thanks for the review @geruh and I have updated the comments accordingly!

Copy link
Copy Markdown
Contributor

@gabeiglio gabeiglio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@geruh geruh merged commit df258f5 into apache:main Apr 22, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants